Weather Data Analysis: A Regression and Classification Approach on the ERA5 Dataset
Course: Data Analytics with Statistics | lecturer: Prof. Dr. Jan Kirenz | Name: Julian Erath, Furkan Saygin, Sofie Pischl | Group: B
Weather, an age-old Earth phenomenon, captivates human interest due to its intricate blend of temperature, wind, and precipitation, molding our surroundings and challenging our understanding of the natural world [^1]. Accurate weather prediction is crucial for agriculture, disaster management, and urban planning, particularly in the context of climate change risks [^2]. The project, titled "Weather Data Analysis: A Regression and Classification Approach on the ERA5 Dataset" aims to contribute to this exploration by examining how different variables interact to create complex weather phenomena.
Data description of sample
The study leverages the ERA5 dataset, sourced from the European Centre for Medium-Range Weather Forecasts (ECMWF), is comprised of atmospheric reanalysis data spanning multiple decades (2015-2022) at hourly intervals and characterized by a spatial resolution of approximately 31 km [^3]. Focusing on the region of Bancroft in Ontario, Canada, the project explores the unique climatic and meteorological characteristics of the area, influenced by the 'lake-effect' phenomenon [^4]. Various meteorological parameters as described below are included in the dataset. The data, labeled by meteorologists and data scientists from IBM and The Weather Company, offers comprehensive global-scale atmospheric information, making it well-suited for detailed analyses and modeling, including climate research, environmental monitoring, and weather forecasting [^5], [^6].
Variables
The dataset, encompasses key variables such as air temperature, wind speed and direction, precipitation (rainfall and snowfall), atmospheric pressure, snow density, cumulative snow, cumulative ice, and weather events. The dataset also includes categorical weather events such as Blue Sky Day, Mild Snowfall, and Storm with Freezing Rain. These variables form the foundation for the assignment's comprehensive analysis [^7].
Overview of data
Initially, the .csv file is loaded, and the data's head is printed for an initial overview of columns (variables) and rows (observations), as can be seen in appendix 5.2 "Display of the Used Dataframe". The dataset comprises 65,345 observations and 184 columns, including unique predictor variables and a response variable. A new dataframe is formed by selecting specific columns and transforming columns to achieve optimized resource usage. This dataframe is later split into training, testing, and validation sets, underlining the foundational role of proper data splitting for reliable machine learning model development and generalization to new data [^8], [^9].
The research is guided by four pivotal questions, addressed through regression and classification analyses.
Regression Hypothesis: There exists a significant correlation between temperature and wind characteristics, which can be modeled to predict future temperature trends and variations. This hypothesis is based on the premise that atmospheric variables are interconnected and can be analyzed to forecast weather conditions. The hypothesis will be examined through the following questions: Is it possible to build an accurate regression model to predict temperature based on historical data? Is it possible to find a correlation or causation between the temperature and the wind features using regression techniques? How does the incorporation of multiple atmospheric predictors enhance the accuracy of temperature prediction compared to a model solely based on windspeed? Can logistic regression effectively classify and predict the occurrence of extreme or normal weather events based on temperature ranges?
Classification Hypthesis: Specific patterns in the weather data can accurately predict various weather events, including extreme conditions. This hypothesis is informed by the need for effective prediction models in the face of increasingly frequent and severe weather events. The following questions will help to evaluate this hypothesis: Is it possible to classify and predict extreme weather events such as storms? Is it possible to categorize and predict different extreme weather events based on multivariate weather data?
The Dataset includes features like the substation (Bancroft), timestamps, weather-related parameters, and various labels for the corresponding weather events. As revealed in the appendix 5.3 "Data Dictionary" most variables are of the "float64" data type (166), 8 variables are of type "int," and 10 are of type "object". First, the variable "avg_temp" is examined. This includes depicting the temperature trend over time (seen in 5.4 "Time series"), as well as displaying the box plot and histogram as shown in appendix 5.8 "Distribution of Weather Features by Weather Event Profiles in Distograms" and 5.9 "Distribution of Weather Features by Weather Event Profiles in Boxplots". Next, the occurrences for these statistics for each weather event profile and their corresponding are considered.
The first phase of methology focused on the comprehensive preparation and processing of the ERA5 dataset, to ensure a solid basis for the subsequent analysis. This phase aims to ensure data quality and maximize the accuracy of the models.
Data acquisition
First, the ERA5 dataset was imported and then read out. For this purpose, the first lines and the metadata of the dataset were examined in order to prepare the dataset for further analysis. This process included the selection of relevant meteorological variables and the creation of new variables that were important for our analysis purposes.
First, the date and time information in the dataset is converted into a standardized date format. Next, the average temperature is converted from Kelvin to Celsius. Finally, the wind direction data, which was in degrees, is converted to cardinal directions (such as northeast, east, etc.) to make the data clearer and the interpretation easier to understand.
As a result of this phase, the dataframe with the following variables was considered for further analysis: Date and the time (run_datetime), Weather event type (wep), Average temperature (avg_temp), Average temperature in Celsius (avg_temp_celsius), Minimum wet bulb temperature (min_wet_bulb_temp), Average dew point (avg_dewpoint), Temperature change (avg_temp_change), Average wind speed (avg_windspd), Maximum wind gust (max_windgust), Average wind direction (avg_winddir), Sine of wind direction (avg_winddir_sin), Cosine of wind direction (avg_winddir_cos), Cardinal wind direction (wind_direction_label), Maximum cumulative precipitation (max_cumulative_precip), Maximum snow density (max_snow_density_6), Maximum cumulative snow (max_cumulative_snow), Maximum cumulative ice (max_cumulative_ice), Average pressure change (avg_pressure_change), Additional WEP labels (label0, label1, label2)
The second phase of methodology comprises an in-depth analysis and visualization of the data in order to gather insights that could be decisive for the objective.
A closer look at the weather event types has shown that 'blue sky day' is the most common with a frequency of 42106, followed by mild snowfall with 3598, moderate snowfall with 2336 and moderate rainfall with 2104. Extreme weather events are relatively little represented. Storm with freezing rain / heavy snow- and icestorm occurred only 69 times, continuous freezing rain 37 times, storm with freezing rain / heavy snow- and icestorm 17 times and snowstorm with high precipitation 10 times (appendix 5.6 "Distribution of All Weather Events"). The temporal component is then examined by plotting unique variables over time.
A frequency distribution of the weather parameters in the form of a histogram allows additional insights to be gained using statistical methods:
A boxplot offers the possibility to examine parameters for important statistical key figures such as median, quartiles, interquartile range (IQR), outliers and distribution. It was created for all the weather parameters mentioned, which confirms the results of the previous investigations of the time series and the histogram. Clear seasonal fluctuations in temperatures and low daily variability. Wind speeds are mostly low, with occasional peaks. Precipitation patterns are mostly low, with rare severe outliers. Snow and ice accumulations are rare, while atmospheric pressure remains mostly stable. These results indicate a climate that is subject to regular seasonal changes, with occasional extreme weather events.
Reversing the view and grouping the data into weather events provides insights into the influence of the individual weather parameters on the selected weather events. The distograms of the weather data from Bancroft reveal several key patterns:
The same was visualized again as a box plot and produced the following findings:
Certain weather events are rarer than others, which is why it is necessary to look at the distribution. A pie chart that first shows all weather events and then the extreme weather events has produced the following results:
Associations and correlations between various meteorological parameters were then investigated. For this purpose, scatterplots were created for all pairs of relevant parameters and the correlation coefficients were calculated. The most important findings:
Inclusion of temperature and wind parameters in regression analyses: Although no direct correlation was found between temperature and wind parameters, they were nevertheless included in the regression analyses. This is based on the assumption that their relationship may be non-linear or influenced by other factors that are not captured by linear correlation.
The results show that a comprehensive consideration of correlations and associations between meteorological parameters is necessary to understand complex interactions and the influence of extreme weather events on these relationships. The study suggests that a combination of linear and non-linear analysis methods is required for a complete understanding of atmospheric dynamics in Bancroft.
In a further analysis, the relationship between wind direction, wind speed and average temperature was investigated. The main results are
Interpretation: The study shows that wind speed and direction vary and are associated with different temperatures. Southwesterly winds tend to correlate with warmer temperatures, while northerly winds bring colder air masses. Higher wind speeds in northerly and westerly winds indicate stronger wind events or a general tendency towards higher wind speeds from these directions.
The results make it clear that the wind direction has a significant influence on wind speed and temperature. These findings are important for weather forecasting and areas such as energy production, where wind energy and temperature management are crucial factors. The analysis suggests that there is a correlation or even a causal link between wind and temperature, as southerly winds are milder and warmer, while northerly winds are stronger and colder.
A transition is now made to a more abstract yet insightfuThe multi-dimensional weather data will be condensed into three principal components, providing a visual exploration of the intrinsic structure and variability of the data. By plotting this 3D PCA scatter plot, hidden patterns, clusters, or anomalies across the weather events are anticipated to be uncovered.
Die PCA-Analyse (Principal Component Analysis) des Wetterdatensatzes hat folgende Schlüsselerkenntnisse geliefert:
The multi-dimensional weather data will be condensed into three principal components, providing a visual exploration of the intrinsic structure and variability of the data. By plotting this 3D PCA scatter plot, hidden patterns, clusters, or anomalies across the weather events are anticipated to be uncovered.
The PCA analysis (Principal Component Analysis) of the weather data set provided the following key findings:
The first step is to select a suitable model. A linear regression, a gradient boosting, an SGD regressor and a support vector regressor were trained for this purpose. All variables that were also considered for the EDA were used as predictors. The response variable in this case is the average wind speed. The ratio of the training data to the test data was set at 80 to 20. The models were evaluated according to the Mean Squared Error (MSE) and the Mean Absolute Error (MAE), with lower values indicating a more accurate model. All models showed similar results, which is why the results were visualized.
Key Insights:
The goal is to use multiple predictor variables to predict the temperature variable with an improved accracy. Firstly it is important to select the features, that bring insights for the temperature variable but are not too much correlated with eachother. A confusion matrix is used to determine the correlation between all variables. In der Korrelationsanalyse des Wetterdatensatzes wurden folgende Erkenntnisse gewonnen:
To answer the objective, the target variable is now changed and an attempt is made to predict the average wind direction with all available variables. Again, the same metrics (MSE, MAE and R-squared) are used to obtain comparable results. The results are discussed in the results chapter.
Es wird ein SARIMAX Modell aufgebaut, welche die Zeitreihe, Trend, Seasonal und Residuals der Parameter avg_temp, avg_winddir ,avg_windspd und avg_windgust visualisiert. Afterwards a Augmented Dickey-Fuller test is performed, which is essential for determining if the time series data for avg_temp, avg_winddir, avg_windspd, and avg_windgust are stationary. This ensures the validity of the SARIMAX model, as non-stationary data can significantly impact the model's accuracy and predictive capabilities. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are used for SARIMAX model selection by balancing model fit and complexity. Lower AIC and BIC values indicate models that effectively capture the data while maintaining simplicity, guiding the choice of the most appropriate SARIMAX model for robust and generalizable forecasts. [^18]. Determining the best ARIMA model parameters is crucial for performance of the model. AutoARIMA automates ARIMA model selection for time series forecasting by optimizing parameters like (p, d, q) based on the AIC or BIC. For the hourly temperature dataset, parameters are set to capture daily seasonality (m=24) and balance model complexity (start_p=1, start_q=1, max_p=3, max_q=3, D=1). This approach aims to effectively model the data's patterns while avoiding overfitting. In the diagnostic phase, visualizing the SARIMAX model's diagnostics is crucial for assessing its fit to the temperature data. This process helps confirm the model's ability to accurately capture data patterns and validates key assumptions like normal distribution of residuals and absence of autocorrelation, ensuring the model's robustness and reliability. It is then fitted again with the revised parameters and evaluated on the test data set. The AIC, BIC, MSE and SSE values are considered for the evaluation.
The best model for every data does not exist, which is why we have to find the best working for this project. Lazypredict streamlines the model comparison process by automatically assessing the performance of multiple models.
XGBRegressor was selected as the model for the next steps.
The next action involves fitting an XGB Regressor to predict average temperature from wind speed and direction features. TimeSeriesSplit is employed for cross-validation, maintaining the temporal sequence of observations. This method evaluates the model's predictive performance on unseen data, ensuring its effectiveness for future forecasting. The chosen hyperparameters for the XGB Regressor aim to optimize the balance between model complexity and accuracy.
Afterwards the results in MSE and MAE are calculated and a visualization of the actual and predicted value is created.
Temperature should have a correlation with time like daytime, day, season etc., which is why the next step is to build a regression model to predict temperature based on historical data. Thefore we choose linear regression, gradient boosting, sgd regressor and svr regressor for this task. This time the train- test split is performed by the year of the data. Afterwards we fit the data, evaluate it by their residuals and visualize the actual versus predicted value for each model. The results will discussed in the results chapter.
To predict extreme weather events with the temperature variable, the project uses logistic regression. First it is required to normalize the data and choose a constant, which which is done by the statsmodels in this case. After the fit of the data into the model we can evaluate it using the aic and the confusion matrix. A final visualization is helpful to understand the results und to find techniques to improve the performance.
The goal is to be able to predict extreme weather events by any variables. For that we first customize our datframe by drop columns, which are not numeric and not needed. After that we choose ´label1´ as our dependend variable and use the lazypredict librarie to find the best model for this case. 'ExtraTrees', 'XGBoost', 'LGBM', 'RandomForest' were choosen as the best models for this case, which is why all of them are implemented. The individual confusion matrix is used to evaluate the model performance in combination with the, precision, recall and f1-score metrics.
Now it is important to classify which extreme weather event it is in particular. Other classifier need to be trained, for this usecase knn, svm, dtc and gbc were chosen. And again the individual confusion matrix is used to evaluate the model performance in combination with the, precision, recall and f1-score metrics.
Is there a significant correlation between temperature and wind characteristics, which can be modeled to predict future temperature trends and variations? This question was addressed within the scope of this project. Various regression techniques were employed, and different sub-questions were examined.
Temperature and Wind Modeling
In the first step, the relationship between wind speed and temperature is investigated. In this context, models such as the Linear Regression Model (LRM), Gradient Boosting Model, Stochastic Gradient Descent Model, and Support Vector Regression Model are utilized to depict the correlation (appendix 5.12 "Linear Regression Analysis Temperature and Wind Modeling Results"). These models are predicting temperature from wind speed using various regression techniques and are compared with each other. Results show a weak correlation, high MSE, and MAE across all models, indicating poor prediction. Outliers and dispersed residuals suggest significant deviations. Support Vector Regression tends to underpredict. Findings suggest the need for multiple regression with additional variables. A subsequent linear regression analysis on wind gusts reinforces the idea that correlated variables may yield successful models but lack scientific value. Multiple regressor analysis is proposed to enhance temperature prediction due to the limited effectiveness of wind speed alone.
Linear Regression Analysis with Multiple Predictors
In the initial phase of the Temperature and Wind Modeling over Time analysis, a Multiple Linear Regression is introduced, as introduced in the lecture. Based on that, the temperature variable is now predicted with improved accuracy using linear regression with multiple predictor variables, addressing the research question of how the incorporation of various atmospheric predictors enhances temperature prediction over different time scales, uncovering interactions and synergies among predictors, and analyzing temporal dynamics to refine the predictive model. The temperature is predicted on windspeed and wind direction in the first step. In the next step, the temperature is predicted using the before-utilized variables (appendix 5.13 "Linear Regression Analysis Multiple Predictors Correlation Matrix of Variables"). For this analysis seasonality and trend for the temperature are also analysed (appendix 5.14 "Linear Regression Analysis Multiple Predictors Seasonality and Trend"). After implementing the Multiple Linear Regression (MLR) model, there can be a lack of accuracy in predicting average temperature from wind speed and direction, as well as from the remaining variables. The overall conclusion underscores the need for further refinement, potentially involving additional features or non-linear models, to enhance predictive accuracy, especially in accurately predicting extreme temperatures.
SARIMAX MODEL
After successfully predicting the temperature parameter through multiple predictor linear regression, the focus shifts to forecasting the temperature parameter with a statistical SARIMAX approach (appendix 5.15 "Linear Regression Analysis Multiple Predictors SARIMAX Forecast Results"). SARIMAX models are among the most widely used statistical models for forecasting, with excellent forecasting performance [^16]. To keep the model's complexity low and avoid lengthy computation times later on, only wind variables are used for an initial approach here. The analysis of Trend and Seasonality revealed a slight variability with some periods showing a gentle rise or fall and a consistent and expected cyclical pattern corresponding to the seasons. The augmented Dickey-Fuller Test (ADF) [^17], Akaike Information Criterion (AIC) [^18], and Bayesian Information Criterion (BIC) are performed on the data. The ADF Test indicated stationarity, the AIC and BIC showed that windspeed and winddirection are the most suitable predictors. After that, the actual SARIMAX Model is created. The evaluation reveals the model's limitations in capturing short-term fluctuations, particularly missing sharp peaks, and consistently overestimating temperatures, indicating a systematic bias and the need for further refinement or alternative modeling approaches to enhance accuracy.
XGBoost
After implementing the SARIMAX as a popular approach for time series analysis, the Lazy Regressor library from sklearn was utilized to find the best-performing regressor. The Lazy Regressor showed that all Regression Models have a rather low R-Squared Value. The XGBoost Regressor is determined as the best-performing Model with an R-Squared Value of 0.13. Based on that, the XGBoost Model is used. The evaluation of the model shows a moderate level of predictive accuracy, with the model following the general temperature trend but exhibiting discrepancies in magnitude and timing, supported by reported Mean Squared Error (MSE) and Mean Absolute Error (MAE) values, suggesting potential for improvement through model tuning and additional feature exploration.
Temporal Prediction
In the next step, the relationship between temperature and time is explored. A Linear Regression Model, Gradient Boosting Regressor, an SGD Regressor, and a Support Vector Regressor are used here. The Evaluation of the plots presents that the Gradient Boosting Regressor demonstrates a promising ability to closely track temperature changes with fewer deviations and a tighter distribution of residuals, supporting the conclusion that linear regression models, while not perfect, can provide valuable forecasts for temperature trends in Bancroft, Canada. The results can be seen in appendix 5.16 "Linear Regression Analysis Prediction Forecast Results".
Temporal Logistic Regression
Logistic regression, placed between linear regression and classification chapters, serves as a bridge to better understand the data story, where blue dots represent actual labels, red dots indicate predicted probabilities, and the orange curve reflects the probability of extreme weather events based on temperature alone (appendix 5.17 "Logistic Regression Analysis Predicting WEP by Temperature Results"). The graph reveals significant overlap in temperature ranges for different event types, leading to high false positives and low recall. Consequently, logistic regression with temperature as the sole predictor is deemed insufficient for this classification task, suggesting the potential need for additional predictors, hyperparameter tuning, or alternative modeling approaches for improved performance.
Conclusion
In conclusion, the investigation into the correlation between temperature and wind characteristics, with the aim of modeling future temperature trends and variations, has yielded valuable insights within the scope of this project. Employing various regression techniques, the exploration delved into different sub-questions surrounding this overarching hypothesis. The results indicate that while initial models, particularly those based solely on wind parameters, exhibited limitations in predictive accuracy, the incorporation of multiple predictors through advanced regression analyses showcased a promising avenue for refinement. The comprehensive evaluation underscores the complexity of the relationship between temperature and wind characteristics, emphasizing the need for nuanced modeling approaches and consideration of additional factors to enhance the precision of temperature predictions over diverse temporal scales. Overall, this study provides a foundation for future research endeavors seeking to unravel the intricate dynamics between meteorological variables and advance our understanding of climate forecasting.
The visualization of the results of the binary classification can be found in appendix 5.18 "Methodology and Results Binary Classification" and displays four confusion matrices, each representing the performance of a different binary classification model: ExtraTrees, XGBoost, LightGBM, and RandomForest. While all models demonstrate high accuracy, with a significant majority of instances correctly classified, which is indicative of their ability to discriminate between the two classes effectively, the LGBM classifier shows the least number of Type II errors, signifying its strength in identifying true extreme weather events with minimal misses. Conversely, the XGBoost classifier presents with the lowest Type I errors, suggesting it is more conservative in predicting extreme weather, thus minimizing false alarms. In practical applications, Type I errors can be particularly critical as they represent missed predictions of extreme weather, which are crucial for timely warnings and safety measures. Therefore, the XGBoost classifier might be preferred in scenarios where the cost of missing an actual extreme weather event is high. Each of these models offers a trade-off between sensitivity to detecting true events and specificity in avoiding false alarms, which needs to be carefully balanced according to the application's requirements and the consequences of prediction errors.
The classification reports found in appendix 5.18 provide an evaluation of the performance of different models. The ExtraTrees model demonstrates high precision and recall for both classes, achieving an accuracy of 99.30%. The precision, recall, and F1-score for both extreme weather events (0) and blue sky events (1) are consistently high, indicating robust performance across both classes. The XGBoost model exhibits excellent precision, recall, and F1-score for both classes, resulting in an overall accuracy of 99.40%. Similar to ExtraTrees, it shows strong performance in correctly classifying both extreme weather and blue sky events. The LightGBM model achieves a high accuracy of 99.37%, with impressive precision, recall, and F1-score for both classes. Notably, it maintains a high recall for extreme weather events (0), ensuring that a significant proportion of these events are correctly identified. The RandomForest model performs well, achieving an accuracy of 99.32%. It shows strong precision, recall, and F1-score for both extreme weather events (0) and blue sky events (1), indicating reliable performance across different weather scenarios. In summary, all four models—ExtraTrees, XGBoost, LightGBM, and RandomForest—demonstrate robust performance in classifying weather events, with high accuracy and consistent precision and recall metrics across the evaluated classes.
After successfully predicting extreme weather and blue sky day weather events, a key result of this research is the prediction of specific extreme weather events. Once it is determined, that an observation is an extreme weather event, it's important to analyse what specific kind of extreme weather event it is. These results can then be used by scientists and governmental institutions to take countermeasures to prevent damage and minimize the risk for a weather event to be hazardous. The analysis for the classification of specific weather events and patterns is conducted using multiclass classification techniques. The research question to be answered ist: Is it possible to categorize and predict different extreme weather events based on multivariate weather data? This involves using multiclass classification algorithms. The results of this classification analysis is the prediction of certain weather events based on the current weather data and a model that was trained on historical weather data.
The multiclass classification is conducted using the models K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Decision Tree Classifier (DTC) and Gradient Boosting Classifier (GBC). These models have fundamentally different functionality so that the different model types can be compared with each other and strengths and weaknesses in the application to weather data can be assessed for each model type. The detailed results and visualisations for each model can be found in the appendix 5.19 "Methodology and Results Multiclass Classification".
KNN's multiclass classification performs well, aligning actual outcomes closely with predictions. The classification report highlights high precision and consistent recall, both with values between 78%-100% through all labels. The F1-score is strong for most classes, with a macro average of 0.88 and a weighted average of 0.92, demonstrating effectiveness despite class imbalance. The model's 92% accuracy underscores its reliability across diverse classes, showcasing robust performance in multiclass classification tasks.
The SVM displays a higher misclassification rate than KNN, particularly misclassifying Class 0 as Class 1. This discrepancy suggests challenges in distinguishing between these classes. The performance gap underscores the need to consider dataset characteristics when selecting a classification algorithm. The classification report indicates some performance variations. Precision for Class 0.0 decreases to 0.81, while recall for Class 1.0 improves to 0.69, leading to an increased F1-score of 0.52. Class 2.0 shows improved precision (0.78) but decreased recall (0.55), resulting in a slightly lower F1-score of 0.64. Class 4.0 sees increased precision (0.70) and a slight recall decrease (0.94), yielding a higher F1-score of 0.80. Macro-average precision and recall remain consistent at 0.75 and 0.77, contributing to a macro-average F1-score of 0.75. The weighted average F1-score is 0.82, indicating an overall improvement in balancing precision and recall with 82% accuracy.
The DTC excels in predicting various weather events, showing impressive performance across multiple metrics with high precision, recall, and F1-score. Particularly noteworthy is its perfect precision and recall for classes 3.0, 4.0, and 5.0. The overall accuracy of 95% highlights its effectiveness in classifying most instances. The Decision Tree's interpretability and simplicity, visualized through a decision tree plot, enhance transparency. However, in some scenarios, more advanced models may outperform it, and decision trees can be susceptible to overfitting.
The GBC Confusion Matrix highlights excellent performance with accurate predictions for most labels. The Classification Report demonstrates impressive precision, recall, and F1-score across diverse weather event classes, maintaining precision rates above 94%. Recall values consistently range from 92% to 100%, showcasing the classifier's ability to identify instances accurately. The 98% overall accuracy underscores its proficiency in classification. Compared to prior models, the Gradient Boosting Classifier excels in accuracy and balanced performance. Its use of multiple decision trees, akin to a random forest, enhances interpretability and simplicity while avoiding overfitting. Its capacity to handle complex relationships within the data makes it a robust choice for this classification task.
The analysis of the classification reports provides valuable insights into the performance of different classifiers across multiple weather event labels. The Extra Trees, XGBoost, and Random Forest classifiers consistently demonstrate high precision, recall, and F1-score across various weather event categories, showcasing their effectiveness in accurately predicting events. The SVM tends to misclassify events more frequently. The GBC and DTC emerge as top performers, providing accurate predictions across a diverse range of weather event labels. Generally, the results for the multiclass classification analysis are excellent, proofing that extreme weather events can be predicted with a very high accuracy using mutliclass classification techniques.
The regression analyses aimed to predict temperature using historical data, achieving satisfactory accuracy in general temperature trend forecasts for the year with linear regression models. The Support Vector Regressor emerged as the most effective model. However, attempts to predict temperature with wind speed in linear regression or a mix of variables in multiple predictor linear regression were unsuccessful. The non-linear relationship and insufficient correlation between temperature and wind variables led to the decision to explore logistic regression and classification techniques. The SARIMAX model used for temperature and wind modeling exhibited a consistent bias, overestimating temperatures, highlighting limitations and prompting the need for alternative modeling approaches. The final regression analysis employed logistic regression to classify extreme weather and clear sky events. However, the approach based solely on temperature was insufficient, emphasizing the need for more complex or multivariate methods to accurately predict hazardous weather conditions. Instead of optimizing logistic regression further, the focus shifted to identifying additional binary classifiers in subsequent classification analyses.
In binary classification the goal was to predict whether an observation was an extreme weather or blue sky day event. The research question was "Is it possible to classify and predict extreme weather events such as storms?". It was identified that extreme weather events can indeed very accurately be separated from blue sky day events and both classes can be predicted with a very high accuracy, precision and recall. "ExtraTreesClassifier," "XGBClassifier," "RandomForestClassifier," and "LGBMClassifier" are the top-performing classifiers based on LazyClassifier's assessment. Each demonstrated high accuracy, with XGBoost slightly leading the pack. These models proved effective in categorizing and predicting weather events from the given data, providing valuable tools for future weather prediction endeavors. The results of this analysis could then be used in multiclass classification, to determine the specific type of extreme weather event.
The multiclass classification further nuanced the understanding of various weather events. The goal was to determine and classify the specific type of extreme weather event, answering the research question "Is it possible to categorize and predict different extreme weather events based on multivariate weather data?". The research question can be answered with yes, the prediction and categorization of various extreme weather events is possible with a very high accuracy, precision and recall. Gradient boosting emerged as a particularly potent method, achieving high precision, recall, and F1-scores across all classes. This success illustrates the potential of sophisticated classification algorithms in deciphering complex weather patterns and predicting diverse weather events. This knowlegde can then also be used by scientists for further research for governmental institutions, e.g., when it comes to taking countermeasures to prevent damage from certain extreme weather events and minimize the risks and dangers.
This project delved into regression and classification analyses of weather data in Bancroft, Ontario, offering insights into atmospheric dynamics. Despite challenges and complexities in meteorological studies, the pursuit of accurate weather prediction demands ongoing model refinement. The absence of linear correlation between wind and temperature variables, as revealed in the EDA, could have led to discontinuation, but the value found in literature influenced the decision to persist. The approach, including PCA and feature selection, provided interesting results, adding value to the scientific discourse. However, the regional bias in the data and the irregular nature of meteorological phenomena emphasize the challenges in making precise predictions. While the analyses presented valuable insights, further optimization, including hyperparameter tuning, remains a potential avenue. Exploring weather patterns and their relationship to climate change could expand understanding, acknowledging potential sources of variance and errors. Recognizing the limitations and external factors influencing weather trends adds humility to the findings, urging future researchers to explore additional dimensions. Despite the contributions to weather prediction, the complexities of meteorological studies and unpredictable weather dynamics necessitate continual refinement and consideration of broader environmental factors. In summary, this project contributes to weather prediction discourse, highlighting the need for multidimensional approaches and the potential of machine learning techniques. As climate variability poses challenges, these insights pave the way for more accurate and comprehensive forecasting methods. Integrating diverse datasets, refining models, and exploring new methodologies are crucial for better forecasting, strategic planning, and preparedness across sectors in the face of weather and climate change impacts.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 65345 entries, 0 to 65344 Columns: 186 entries, Unnamed: 0 to wind_direction_label dtypes: datetime64[ns](2), float64(167), int64(8), object(9) memory usage: 92.7+ MB
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | 65345.0 | 32685.658321 | 0.0 | 16343.0 | 32689.0 | 49025.0 | 65361.0 | 18867.701277 |
| run_datetime | 65345 | 2019-04-06 14:09:11.362766848 | 2015-07-15 00:00:00 | 2017-05-25 23:00:00 | 2019-04-07 01:00:00 | 2021-02-14 16:00:00 | 2022-12-27 08:00:00 | NaN |
| valid_datetime | 65345 | 2019-04-06 14:09:11.362766848 | 2015-07-15 00:00:00 | 2017-05-25 23:00:00 | 2019-04-07 01:00:00 | 2021-02-14 16:00:00 | 2022-12-27 08:00:00 | NaN |
| horizon | 65345.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| avg_temp | 65345.0 | 279.574328 | 243.849393 | 271.114219 | 279.882735 | 289.903226 | 300.934144 | 11.383325 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| label2 | 12712.0 | 3.06191 | 0.0 | 1.0 | 3.0 | 5.0 | 6.0 | 2.126446 |
| label3 | 65345.0 | 1.1811 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 | 0.740687 |
| year | 65345.0 | 2018.745535 | 2015.0 | 2017.0 | 2019.0 | 2021.0 | 2022.0 | 2.162032 |
| month | 65345.0 | 6.711852 | 1.0 | 4.0 | 7.0 | 10.0 | 12.0 | 3.446477 |
| avg_temp_celsius | 65345.0 | 6.424328 | -29.300607 | -2.035781 | 6.732735 | 16.753226 | 27.784144 | 11.383325 |
177 rows × 8 columns
| run_datetime | wep | avg_temp | avg_temp_celsius | min_wet_bulb_temp | avg_dewpoint | avg_temp_change | avg_windspd | max_windgust | avg_winddir | ... | avg_winddir_cos | wind_direction_label | max_cumulative_precip | max_snow_density_6 | max_cumulative_snow | max_cumulative_ice | avg_pressure_change | label0 | label1 | label2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-07-15 00:00:00 | Blue sky day | 287.389224 | 14.239224 | 280.809506 | 280.735246 | NaN | 3.386380 | 14.899891 | 80.302464 | ... | 0.190676 | East | 2.009 | 0.0 | 0.000 | 0.0 | 52.892217 | 0 | 1 | NaN |
| 1 | 2015-07-15 01:00:00 | Blue sky day | 287.378997 | 14.228997 | 280.809506 | 280.414058 | -0.010227 | 3.326687 | 14.899891 | 76.866373 | ... | 0.102466 | East | 1.209 | 0.0 | 0.000 | 0.0 | 50.256685 | 0 | 1 | NaN |
| 2 | 2015-07-15 02:00:00 | Blue sky day | 287.388845 | 14.238845 | 280.809506 | 280.187074 | 0.009848 | 3.243494 | 14.899891 | 76.258867 | ... | 0.651950 | East | 0.400 | 0.0 | 0.000 | 0.0 | 47.944054 | 3 | 1 | NaN |
| 3 | 2015-07-15 03:00:00 | Blue sky day | 287.427324 | 14.277324 | 280.809506 | 280.049330 | 0.038479 | 3.145505 | 14.899891 | 78.299616 | ... | -0.971290 | East | 0.000 | 0.0 | 0.000 | 0.0 | 45.855264 | 2 | 1 | NaN |
| 4 | 2015-07-15 04:00:00 | Blue sky day | 287.489158 | 14.339158 | 280.809506 | 279.980697 | 0.061834 | 3.047607 | 14.702229 | 84.632852 | ... | -0.981976 | East | 0.000 | 0.0 | 0.000 | 0.0 | 44.823453 | 2 | 1 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 65340 | 2022-12-27 04:00:00 | Moderate rain | 264.241641 | -8.908359 | 260.284794 | 262.061976 | -0.124561 | 1.962197 | 8.444256 | 232.606824 | ... | 0.991695 | Southwest | 2.126 | 0.0 | 25.643 | 0.0 | NaN | 5 | 0 | 3.0 |
| 65341 | 2022-12-27 05:00:00 | Blue sky day | 264.115391 | -9.034609 | 260.284794 | 262.114357 | -0.126250 | 1.978823 | 7.475906 | 229.938704 | ... | -0.823955 | Southwest | 2.226 | 0.0 | 21.161 | 0.0 | NaN | 5 | 1 | NaN |
| 65342 | 2022-12-27 06:00:00 | Blue sky day | 264.024853 | -9.125147 | 260.284794 | 262.206179 | -0.090537 | 2.005855 | 7.305549 | 227.024163 | ... | 0.675251 | Southwest | 2.426 | 0.0 | 16.430 | 0.0 | NaN | 5 | 1 | NaN |
| 65343 | 2022-12-27 07:00:00 | Blue sky day | 264.048368 | -9.101632 | 260.284794 | 262.350025 | 0.023514 | 2.040978 | 7.305549 | 223.900355 | ... | -0.662027 | Southwest | 2.826 | 0.0 | 10.859 | 0.0 | NaN | 5 | 1 | NaN |
| 65344 | 2022-12-27 08:00:00 | Blue sky day | 263.918722 | -9.231278 | 260.284794 | 262.512490 | -0.129646 | 2.078741 | 6.818578 | 220.894487 | ... | 0.554528 | Southwest | 3.426 | 0.0 | 5.640 | 0.0 | NaN | 0 | 1 | NaN |
65345 rows × 21 columns
| Name | Description | Role | Type | Format | |
|---|---|---|---|---|---|
| 0 | run_datetime | Date and time when the weather observations we... | ID / predictor | numerical continuous / ID | <class 'pandas._libs.tslibs.timestamps.Timesta... |
| 1 | wep | Weather Event Type (WEP) is a categorization o... | response | categorical nominal | <class 'str'> |
| 2 | avg_temp | The average temperature measured at two meters... | response / predictor | numerical continuous | <class 'numpy.float64'> |
| 3 | min_wet_bulb_temp | Minimum wet bulb temperature recorded during t... | predictor | numerical continuous | <class 'numpy.float64'> |
| 4 | avg_dewpoint | Average dewpoint temperature observed during t... | predictor | numerical continuous | <class 'numpy.float64'> |
| 5 | avg_temp_change | Average change in temperature during the obser... | predictor | numerical continuous | <class 'numpy.float64'> |
| 6 | avg_windspd | Average wind speed measured during the recordi... | predictor | numerical continuous | <class 'numpy.float64'> |
| 7 | max_windgust | Maximum wind gust observed during the recordin... | predictor | numerical continuous | <class 'numpy.float64'> |
| 8 | avg_winddir | Average wind direction (in degree) observed du... | predictor | numerical continuous | <class 'numpy.float64'> |
| 9 | wind_direction_label | Wind direction (in cardinal direction) observe... | predictor | categorical ordinal | <class 'str'> |
| 10 | max_cumulative_precip | Maximum cumulative precipitation recorded, con... | predictor | numerical continuous | <class 'numpy.float64'> |
| 11 | max_snow_density_6 | Maximum snow density at a depth of 6 inches, c... | predictor | numerical continuous | <class 'numpy.float64'> |
| 12 | max_cumulative_snow | Maximum cumulative snow recorded, considering ... | predictor | numerical continuous | <class 'numpy.float64'> |
| 13 | max_cumulative_ice | Maximum cumulative ice recorded, considering a... | predictor | numerical continuous | <class 'numpy.float64'> |
| 14 | avg_pressure_change | Average change in atmospheric pressure during ... | predictor | numerical continuous | <class 'numpy.float64'> |
<Figure size 2000x1500 with 0 Axes>
<Figure size 2000x1500 with 0 Axes>
<Figure size 2000x1500 with 0 Axes>
<Figure size 2000x1500 with 0 Axes>
<Figure size 2000x1500 with 0 Axes>
<Figure size 2000x1500 with 0 Axes>

X_train shape: (52276, 1) X_test shape: (13069, 1) y_train shape: (52276,) y_test shape: (13069,) Linear Regression Model: Mean Squared Error: 127.78 Mean Absolute Error: 9.63 Gradient Boosting Model: Mean Squared Error: 127.72 Mean Absolute Error: 9.63 Stochastic Gradient Descent Model: Mean Squared Error: 127.78 Mean Absolute Error: 9.63 Support Vector Regression Model: Mean Squared Error: 129.85 Mean Absolute Error: 9.57
<Figure size 2000x1500 with 0 Axes>
<Figure size 2000x1500 with 0 Axes>
<Figure size 2000x1500 with 0 Axes>
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 9 M = 10
At X0 0 variables are exactly at the bounds
At iterate 0 f= -2.33934D+00 |proj g|= 2.27869D+01
This problem is unconstrained.
At iterate 5 f= -2.35851D+00 |proj g|= 5.07954D+00
At iterate 10 f= -2.42141D+00 |proj g|= 1.55207D-01
At iterate 15 f= -2.42143D+00 |proj g|= 2.57911D-01
At iterate 20 f= -2.42185D+00 |proj g|= 8.96676D-01
At iterate 25 f= -2.42207D+00 |proj g|= 1.29386D-02
At iterate 30 f= -2.42237D+00 |proj g|= 1.15344D+00
At iterate 35 f= -2.42282D+00 |proj g|= 1.45321D-01
At iterate 40 f= -2.42283D+00 |proj g|= 2.66534D-01
At iterate 45 f= -2.42324D+00 |proj g|= 7.45443D-01
At iterate 50 f= -2.42344D+00 |proj g|= 9.01914D-02
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
9 50 59 1 0 0 9.019D-02 -2.423D+00
F = -2.4234367498076992
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT
<Figure size 2000x1500 with 0 Axes>
<Figure size 2000x1500 with 0 Axes>

Optimization terminated successfully.
Current function value: 0.375341
Iterations 7
AIC: 39246.63967824996
Label 0: Extreme Weather Event
Label 1: Blue Sky Day
Classification Report:
precision recall f1-score support
0 0.39 0.20 0.26 2515
1 0.83 0.93 0.88 10554
accuracy 0.79 13069
macro avg 0.61 0.56 0.57 13069
weighted avg 0.75 0.79 0.76 13069
<Figure size 800x550 with 0 Axes>
<Figure size 800x550 with 0 Axes>
ExtraTrees Accuracy: 0.9931899915831357 XGBoost Accuracy: 0.9946438136047134 LGBM Accuracy: 0.9935725763256561 RandomForest Accuracy: 0.9934195424286479
<Figure size 800x550 with 0 Axes>
[1]: Liljequist, G.H. / Cehak, K. (1984): Allgemeine Meteorologie. 3. Auflage, Springer-Verlag. Engineering 29.2 (2022, Springer): 1247–1275
[2]: The contribution of weather forecast information to agriculture, water, and energy sectors in East and West Africa
[3]: ECMWF (2023a): ERA5: data documentation. URL: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation
[4]: A Hybrid Dataset of Historical Cool-Season Lake Effects From the Eastern Great Lakes of North America
[5]: Hjelmfelt, M.R. (1990): Numerical study of the influence of environmental conditions on lake-effect snowstorms over Lake Michigan, in: Monthly Weather Review, 118(1), pp.138-150.
[6]: de Lima, Glauston, R.T. / Stephan, S. (2013): A new classification approach for detecting severe weather patterns, in: Computers & geosciences 57 (2013): 158-165.
[7]: ECMWF (2023b): ERA5: data documentation parameterlistings. URL: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Parameterlistings
[8]: Scikit-learn (2023): https://scikit-learn.org/stable/documentation.html
[9]: Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.
[10]: Gregor, S. / Hevner, A.R. (2013): Positioning and Presenting Design Science Research for Maximum Impact, in: MIS Quarterly, Jg. 37, Nr. 2, S. 337-355; Hevner, A. / Chatterjee, S. (2010): Design Research in Information Systems, Theory and Practice. Hrsg. von R. Sharda/S. Voß. Bd. 22. Integrated Series in Information Systems. New York, NY, USA: Springer New York, NY.; Hevner, A. / March, S.T. / Park, J. / Ram, S. (2004): Design Science in Information Systems Research, in: MIS Quaterly 28.1, S. 75–105.
[11]: Wilde, T. and Hess, T., 2007. Forschungsmethoden der wirtschaftsinformatik. Wirtschaftsinformatik, 4(49), pp.280-287.; Goldman, N. and Narayanaswamy, K., 1992, June. Software evolution through iterative prototyping. In Proceedings of the 14th international conference on Software engineering (pp. 158-172).
[12]: Reflective physical prototyping through integrated design, test, and analysis
[13]: Design Science in Information Systems Research.
[14]: Shao, J., 1993. Linear model selection by cross-validation. Journal of the American statistical Association, pp.486-494.; Browne, M.W., 2000. Cross-validation methods. Journal of mathematical psychology, 44(1), pp.108-132.
[15]: Webster, J. / Watson, R.T. (2002): Analyzing the past to prepare for the future: Writing a literature review, in: MIS quarterly. Jun 1: xiii-xiii.
[16]: Ortiz, Joaquin Amat Rodrigo and Javier Escobar (n.d.): Forecasting SARIMAX and ARIMA models - Skforecast Docs, [online] https://joaquinamatrodrigo.github.io/skforecast/0.7.0/user_guides/forecasting-sarimax-arima.html#.
[17]: Prabhakaran, Selva (2022): Augmented Dickey Fuller Test (ADF Test) – must read guide, Machine Learning Plus, [online] https://www.machinelearningplus.com/time-series/augmented-dickey-fuller-test/.
[18]: Zach (2021): How to calculate AIC of regression models in Python, Statology, [online] https://www.statology.org/aic-in-python/.
Fathi, M. / Haghi Kashani, M. / Jameii, S. M. / Mahdipour, E. (2022): Big Data Analytics in Weather Forecasting: A Systematic Review, in: Archives of Computational Methods in Engineering 29.2 (2022, Springer): 1247–1275
Ghirardelli, J.E. (2005): An Overview of the Redeveloped Localized Aviation Mos Program (Lamp) For Short-Range Forecasting.